AlphaFold2 models indicate that protein sequence determines both structure and dynamics

AlphaFold 2 (AF2) has placed Molecular Biology in a new era where we can visualize, analyze and interpret the structures and functions of all proteins solely from their primary sequences. We performed AF2 structure predictions for various protein systems, including globular proteins, a multi-domain protein, an intrinsically disordered protein (IDP), a randomized protein, two larger proteins (> 1000 AA), a heterodimer and a homodimer protein complex. Our results show that along with the three dimensional (3D) structures, AF2 also decodes protein sequences into residue flexibilities via both the predicted local distance difference test (pLDDT) scores of the models, and the predicted aligned error (PAE) maps. We show that PAE maps from AF2 are correlated with the distance variation (DV) matrices from molecular dynamics (MD) simulations, which reveals that the PAE maps can predict the dynamical nature of protein residues. Here, we introduce the AF2-scores, which are simply derived from pLDDT scores and are in the range of [0, 1]. We found that for most protein models, including large proteins and protein complexes, the AF2-scores are highly correlated with the root mean square fluctuations (RMSF) calculated from MD simulations. However, for an IDP and a randomized protein, the AF2-scores do not correlate with the RMSF from MD, especially for the IDP. Our results indicate that the protein structures predicted by AF2 also convey information of the residue flexibility, i.e., protein dynamics.

. Proteins models used in the present work. 1 Number of amino acid residues. 2 The MSA hits from the BFD 3 (Big Fantastic Database). The MSA hits include those that match the protein partial segments. 3 Mean ± SD for per-residue pLDDT, IUPRED2 and RMSF values. 4 Two chains of the heterodimer are PAS-A (108 AA) and kinase (287 AA) domain sequences, respectively. 5 Both chains of the homodimer have the same sequence of 146 AA. 6 The Pearson's correlation coefficient (PCC) between pLDDT and RMSF scores, the slope and intercepts of the linear fitting between them are also listed; note that as pLDDT and the AF2 scores in this work are anticorrelated, and the PCC values are the negative of those shown in the Figures. www.nature.com/scientificreports/ Methods Protein structure models. AF2 (V2.0.1) is used for structure predictions with the required databases downloaded from the AF2 GitHub repository 3 . Table 1 summarizes the protein models used in the present work. The AF2 structure models of these proteins are shown in Fig. S1 of the Supplementary Information (SI). All protein sequences can be found in the Appendix of the SI.
MD simulation. Molecular dynamics (MD) simulations were performed using the NAMD program 37 . The protein atoms were typed according to the CHARMM force field 38,39 (c36m) and a modified TIP3P model 40 was used for the solvent water molecules. All hydrogen atoms were added using the HBuild function of CHARMM 40,41 . The proteins were placed in unit cells with at least 12 Å of solvent water molecules added to each edge. The solvation and neutralization (using Na + and/or Cl − ) were carried out by the Solvate and Autoionization packages of VMD 42 . After solvation and neutralization, the whole system was optimized by 50,000 steps. Next, the temperature of the system was increased to 300 K with a rate of 0.001 K/timestep, followed by constant-pressure, constant-temperature (NPT) equilibration at a pressure of 1 atm and temperature of 300 K maintained by Langevin piston controls.The SHAKE algorithm was applied to fix the bond lengths involving hydrogen atoms. The simulations were conducted using a timestep of 2 fs and a non-bonded interaction cutoff switching of 9 to 11 Å. The protonation states of all titratable residues of the protein were determined by ProPka 43 at neutral pH of 7. For each protein, the system was equilibrated for 10 ns, followed by a 100 ns production run with the trajectory saved every 10 ps. The analyses were based on the production runs (10 k frames each).

Analysis of the MD trajectories.
We used the R package bio3d 44,45 to analyze the 100 ns production run MD trajectories: the root mean square deviation (RMSD), root mean square fluctuation (RMSF) calculations and the mass-weighted principal component analysis (PCA) for the movement of the protein backbone atoms. Distance variations (DV) were calculated from the MD simulations in order to gain insight into the predicted aligned errors (PAE) provided by AF2.. With the assumption that the PAE map contains the dynamics information of the protein, the DV can be regarded as a 1D simplification of the PAE (see below). In the definition and estimation of the PAE, AF2 performs "alignments" between the predicted structure and the "true" (or "real") structure. The PAE between the residues x and y, i.e., the (x, y) element of the PAE matrix is estimated as the error of the x residue if the y residue is aligned ( Fig. 1) 3,29 . Here, for two residues x and y, we define the distance deviation (DV) as: where r xy is the distance between the Cɑ atoms of residues x and y monitored through the MD trajectory. IQR is the interquartile range. The DV also has a unit of Å. If we assume the residue y is fixed, then the calculated IQR could be regarded as the (1D) variation of residue x. Use of IQR in Eq. (1) can avoid the biases from outliers (extreme long or short r xy ). Note that the PAE matrix is asymmetric 3,29,46 as for any (x, y) pair, the uncertainty assigned to x may be different than that to y. However, the DV matrix is symmetric because the distance variations neglected the residue compositions. We consider the dynamic assumption valid if the DV map matches the PAE map-that is, the PAE map from AF2 originates from the protein dynamics.
Other tools and data availability. Besides AF2 structure prediction and the MD simulations, IUPRED2 36 was used for prediction of the intrinsic disorder content based on the protein sequence. VMD 42 was used to plot structural figures and generate animations of the principal movement (PC1) of the proteins during the MD simulations. PyMol (2.4.1) was used to plot the residue dynamic cross correlation matrix (DCCM) analyzed by bio3d 44 . All figures were prepared using R. The R codes used for the principal component analysis, and the DV calculations (together with the heatmap plot) are available from the GitHub repository: https:// github. com/ haobo guo/ AF2. Res. Flex. This repository also contains the coordinates (PDB format) of the AF2 structures used in this work, and the animations of the primary movements (PC1) of the two domain protein GNE (vibration between the two domains) and the homodimeric MerR-family protein (opening-and-closing dynamics) from (1) DV = IQR r xy , Figure 1. PAE vs DV. (a) PAE(x,y) is the error of residue x between the predicted (blue) and the "true" (red) models when residue y is aligned (x' could be observed anywhere in the yellow sphere). (b) DV(x,y) is the distance variation between residues x and y monitored from MD, and the DV can be regarded as the (1D) variations in the movements of residue x with the position of residue y fixed. www.nature.com/scientificreports/ Mycobacterium tuberculosis, calculated by PCA. All heatmaps were plotted using the heatmap.2 function from the gplots package of R. The B-factors from X-ray crystallography were used when available. Because the B-factor can be compared with RMSF via, where RMSF is derived from MD: a higher B or RMSF value corresponds to higher flexibility of a residue. Similarly, the B-factors could be inferred from an ensemble of configurations detected by NMR 47 . The square root of B-factors was used in the comparisons with RMSF (see in "Results and Discussion").
The pLDDT scores are listed in the B-factor column of the AF2 protein models in the AlphaFold database 4,46 . However, we found the pLDDT scores from the AF2 protein models (Fig. 2) usually show an anti-correlation with the RMSF values calculated from MD. Furthermore, the pLDDT scores exhibit an opposite trend to what the B-factors or RMSF indicate. Here, for consistency, we define a new parameter, the AF2-score, as the normalized reversed pLDDT scores, i.e., The calculated AF2-scores are highly correlated with the RMSF (see Results and Discussion section), indicating the models generated by AF2 also contain information about the residue flexibility. For PAE and DV heatmaps, the color gradually changes from the highest (white) to lowest (dark green) in 256 bins, only exception is for the DV heatmap of PAS-Kinase, the color bar is manually adjusted (Fig. 4 caption) to highlight the similarity of both matrices.

Results and discussion
AF2-scores represent residue flexibility. AF2 provides the per-residue pLDDT (predicted local distance difference test) scores for each residue of the final model. This score is in the range of [0,100] and represents the confidence of the predicted structure compared to the "true" (ground truth) structure. We used the AF2-scores, as a reversed normalization of the pLDDT scores (see Methods section), and found that the AF2- www.nature.com/scientificreports/ scores are highly correlated with the residue flexibility. In reality, even the "true" X-ray crystallographic structure represents an ensemble of protein configurations embedded in the crystal lattices. The flexibilities of the atoms are usually recorded as the temperature factors (or B-factors) in the PDB files. From MD simulations, the residue flexibility can be calculated as the root-mean-square fluctuation (RMSF, see "Methods"). Figure 2 compares various measures related to residue flexibility for the AF2 models built from the sequences of four proteins. The first protein (Fig. 2a, 133 AA) is the apo-form of the full length lanmodulin (LanM) protein. LanM is an interesting protein that could shed light on rare earth element sequestration [48][49][50] . The LanM protein solved by NMR binds to three yttrium ions and has its N-terminus (residues M1 to A22) cleaved 51 (PDB ID 6MI5). For the apo LanM, MD simulation shows high-flexibility of the N-terminal tail, consistent with the AF2-scores. The RMSF of all residues is highly correlated with the AF2-scores (Pearson's correlation coefficient, or PCC = 0.843, P = 0). However, the disorder prediction by IUPRED2 interprets that the N-terminus is in a well-ordered state (the median disorder content of 0.07), contradicting both the AF2 score and the RMSF calculated from MD.
The second protein studied (300 AA) is a dehalogenase recently sequenced from the bacterium Delftia acidovorans strain D4B 52 . Because D. acidovorans had been cultured in presence of perfluorooctanoic acid (PFOA) 52 , this dehalogenase might be involved in defluorination of PFOA (or other fluorinated compounds). For this model, the RMSF from MD correlates well with the AF2-score for the dehalogenase (PCC = 0.941, P = 0), as shown in Fig. 2b. Moderate correlation is observed between RMSF and the IUPRED2 scores (PCC = 0.183, P = 0).
The third protein is part of the human PAS-A-domain containing serine/threonine kinase (1323 AA, Q96RG2 in the AF2 database). Here the PAS-A domain sequence (M130-R237, M130 is mutated from the original P130) is used to build the PAS-A domain model. The PAS-A domain is speculated to regulate the kinase activity by sensing environmental stimuli. In general, PAS domains are found in all three domains of life and have a well-conserved tertiary structure, albeit with diverse sequences 53 . It is shown in Fig. 2c that the high flexibility of the central region of the PAS-A domain (residues 45 to 70) revealed by the RMSF profile is also reproduced by the high AF2-scores (low pLDDT scores). However, the IUPRED2 score does not correlate with the RMSF for this protein.
The fourth protein is an antifreeze protein (AFP, Fig. 2d, 66 AA), which also has a high-resolution X-ray crystallographic structure (PDB entry 1HG7 54 , resolution 1.15 Å), such that the experimental B-factors are available for comparisons. In this example, the AF2 score has a near-perfect correlation with the RMSF (PCC = 0.972, P = 0). However, the IUPRED score does not show positive correlation with the RMSF. The crystal lattice effect in the X-ray structure may lead to rigid body motions which affect the B-factor profile (square-root used, see Eq. 2), but it also shows good correlation with the RMSF (PCC = 0.578, P = 0).
The data for all four models in Fig. 2 indicate that the AF2 scores can be used to predict the residue flexibilities, as measured by the RMSF from MD simulations. However, for these proteins, the disorder predictors (such as IUPRED2) for these proteins do not seem to predict residue flexibility. It had been shown that combining both the flexibility (B-factor) and the disorder contents the protein sequences can be classified into four different categories: low-B-factor ordered, high-B-factor ordered, short-disordered and long-disordered regions 55 . This also explains why the IUPRED2 scores and the RMSF values do not correlate well. The above results indicate that in addition to solving 3D structures from amino acid sequences, AF2 accurately predicts residue flexibilities from the pLDDT scores (or AF2-scores). It should be pointed out that all protein sequences in Fig. 2 have relatively large MSA depths (> 1000)一here, the MSA depth is defined as the number of aligned or partially aligned sequences from the BFD 3 (see Table 1 for the MSA depth of all models used in the present work). PAE from AF2 is associated with protein dynamics. The predicted aligned errors (PAE) derived from the predicted template modeling (pTM) scores clearly show that the residues within the same domain exhibit lower PAEs than the inter-domain residues. The AF2 model that serves as the tutorial for the PAE is the human GNE protein, a two-domain, bifunctional enzyme playing a key role in sialic acid biosynthesis 56 . Using this model structure, an MD simulation was performed and the distance variations (DVs) among the Cɑ atoms of residues were computed to compare with the PAE map.
For multi-domain systems, all-atom structural superposition based on the minimization of the overall RMSD 57 may be inappropriate. For these systems, a principal component analysis can be used to examine the relationship between the domains 58 . The AF2 profile of the two domain GNE protein (Fig. 3a) shows that the linker (residue 381 to 401) between the two domains has high AF2-scores, together with both the C-and N-termini, indicating high flexibility of these regions. Instead of all-atom structural superposition, we used a domain-specific approach: first, only the residues of domain 1 (1 to 380) were superimposed and the RMSF values for domain 1 were acquired based on this superposition; then only the residues of domain 2 (402 to 722) are superimposed and the RMSF values for domain 2 calculated. However, the RMSF values of the linker region (residues 381 to 401) were averaged from both superpositions. This domain-specific superposition approach yielded RMSF of the whole protein highly consistent with the AF2-scores (Fig. 3a). This analysis is in line with our hypothesis that AF2 correctly predicts the residue flexibility via the pLDDT or AF2-scores (Fig. 2). We also calculated the RMSF values using an all-atom superposition approach as those for the one-domain proteins (Fig. 2), however, this approach cannot correctly address the flexibility, especially that of the linker between the domains (Fig. S2 in the SI). The RMSF profile of the LanM protein shown in Fig. 2a is obtained from an all-atom superposition. Similar to the domain-specific approach used for GNE, because LanM has a long disordered N-terminus (residues 1-22), if we average the RMSFs from superposition with or without the N-terminus (residues 23 to 133), the AF2-scores have a better correlation to the RMSF plot, as shown in Fig. S3 in the SI.
MD simulation studies often use the RMSD (root mean square deviation) profile with respect to the first frame of the trajectory, to determine whether the system has equilibrated. However, in the two-domain system, we observed that the principal movement (PC1 from PCA) corresponds to the anti-correlated movement of the www.nature.com/scientificreports/ two domains (Fig. 3b and Supplementary movie). With the large amplitude interdomain movement, the RMSD profile of this protein does not converge to a plateau. This has been observed previously for the MerR homodimer system regarding opening-and-closing dynamics 59 . We also monitored the interdomain center-of-mass (COM) distances from the MD trajectory, observing large amplitude fluctuations during the 100 ns MD with the interdomain COM varying from 47.5 to 60 Å (Fig. S4 in the SI). Interestingly, the RMSD profiles monitored using the two extreme interdomain COM configurations as the reference states exhibited mirror symmetry owing to the anticorrelation of the RMSD profiles. This hidden symmetry from MD simulations of vibrating systems had also been observed in both MerR and CurR homodimer proteins 59 .
We calculated the distance variations (DV) among the Cɑ atoms of all residues. The DV map and PAE map from AF2 (Fig. 3c) show highly consistent patterns: the distance variations (or predicted errors) of residues within the same domain are relatively small; whereas the variations/errors of residues from different domains are relatively large. For this two domain protein, the maximal PAE reported from AF2 is 31.8 Å, and the maximal calculated DV is 18.0 Å. This may be due to that the DV is estimated as the IQR, i.e., the Cɑ-Cɑ distance at the 75% quantile subtract that at the 25% quantile, which may be roughly half of difference between maximal and minimal Cɑ-Cɑ distance. Also the DV calculation is a 1D simplification of the real PAE (Fig. 1), which may also give errors to the estimation. However, the consistent trends between PAE and DV maps (PCC = 0.732, P = 0, Fig. 3c) indicate that PAE originates from protein dynamics, and that the Evoformer neural network of AF2 decodes this dynamics information (encoded in the protein sequences) through multisequence alignment.
The PAE and DV heatmaps for the models in Fig. 2 (see Fig. S5) show statistically significant correlations similar to Fig. 3c. These results substantiate the usefulness of the PAE's predicted by AF2 for capturing protein dynamics.
Large proteins. It remains a challenge for AF2 to model extremely large proteins, such as the human Titin protein (34,350 AAs) which include a long IDPR of over 2100 AAs 60 . The AlphaFold database of the human structurome does contain 3D models for smaller fragmental (1400 AA) of the Titin 4 . Other proteins containing residues as large as 2,180 AA's (with no structural homologues) have also been reported with the TM-score of 0.96 3 . Here, we have also modeled two large proteins (> 1000 amino acids), including one with considerable disordered regions. Figure 2c represents the result for the PAS-A domain protein, which is only the regulatory part of the PAS-A domain containing kinase (PAS-kinase) from H. sapiens 53 . We modeled the structure of the full length PASkinase, which contains 1,323 AAs. This structural model has been also modeled by the AF2 team 4 (AF2 entry: Q96RG2) and can be obtained from the AF2 database at (https:// www. alpha fold. ebi. ac. uk/ entry/ Q96RG2). In the present work, both the AF2-scores and the PAE map (Fig. 4a) indicate the PAS-kinase model indicates two structural domains: the PAS-domain that comprises both the PAS-A domain (residue 130 to 237, Fig. 2c) and PAS-B domain (residue 238 to 401) as well as the kinase-domain (residues 892 to 1269). However, the other regions of the full PAS-kinase are mostly disordered (see the PAS-kinase structure in Fig. S1). Both the AF2 and IUPRED2 scores correlate well with the RMSF calculated from the MD simulation (Fig. 4a). The PAE map also indicates the existence of two structure regions (or domains): the PAS-domain and the kinase domain, which is also supported by the DV map (Fig. 4c). Moreover, the interdomain regions in the DV map have relatively small distance variations, which is consistent with the PAE map and reflects the interactions between the two domains.
We also analyzed the ice nucleation protein inaZ from Pseudomonas syringae (UniProt code P06620). This 1,200 amino acid protein enables the microbial organism to facilitate the crystallization of supercooled water 61 . The ice nucleating properties of P. syringae are key for their biological function 62,63 , and confer them a role in cloud glaciation and precipitation 64,65 , as well as in artificial snow making 66 . Fragments of the inaZ protein had been modeled 67 but the structure of the full-length protein has never been predicted. The AF2 structure (Fig. S1 in the SI) indicates that the inaZ has a N-terminal domain in ɑ/β fold (residues M1 to A110) and a ca. 30 nmlong domain constituted by antiparallel β-strands (residues Q171 to K1189), followed by C-terminal tail in coil state (residues P1180 to K1200).
For the inaZ protein, the AF2-scores are also strongly correlated with the RMSF from the MD simulation (Fig. 4b). The PAE map from AF2 and DV map from MD (Fig. 4d) both indicate the existence of two separated segments in the inaZ protein. The AF2 profile (Fig. 4c) shows repeating peaks from the β-strands, which is Figure 3. The dynamic nature of the predicted aligned error from AF2 and the dynamics of a two-domain protein GNE (AF2 entry: Q9Y223). (a) The AF2 scores correlate well with the RMSF values calculated with a domain-specific approach (see in main text); the IUPRED2 scores also show a correlation to the RMSD. An arbitrary unit (AU) is applied as both AF2 and RMSF are normalized. (b) Residue cross correlations calculated from the MD trajectory show that residues within the same domain tend to have correlated movements (i.e., moving toward the same directions) whereas residues from different domains tend to have anticorrelated movements (or moving toward opposite directions). This is also reflected in the principal movement, which is the vibration between the two domains (Supplementary movie). (c) The predicted assigned error (PAE) map is calculated via AlphaFold2-pTM and it can also be found in the AF2 PAE tutorial (the coloring scheme is slightly different) (https:// alpha fold. ebi. ac. uk/ entry/ Q9Y223); and the distant variation (DV) map calculated from a 100-ns MD trajectory. Both heatmaps indicate that the residues within the same domain have a relatively small PAE (left) or DV (right), whereas the PAE/DV for residues from different domains are relatively large. The color histograms of the PAE or DV values are also plotted in the color bars: Both PAE and DV histogram have a peak at short distances, but in PAE histogram there is an additional peak at long distance (ca. 27 Å), indicating AF2 yields larger interdomain errors than the MD simulation.  www.nature.com/scientificreports/ www.nature.com/scientificreports/ also reflected in the RMSF profile. Although the magnitudes of these peaks are different, the peak positions are precisely consistent. This consistency is also shown in other systems (Figs. 2 and 3). For the other large protein PAS-kinase (Fig. 4a), overall correlation between AF2 and RMSF profiles is observed, but not at the finger-printlike accuracy of inaZ, which may be owing to the IDPRs in the PAS-kinase (Fig. S1 in the SI). Similarly, the PAE and DV maps are also consistent, but the PAE from AF2 modeling generally propose larger error ranges than the DV from MD. Not only because DV can be regarded as a 1D simplification as PAE (see Methods), this may also be derived from the PAE evaluation, which is based on the calculation of the predicted template modeling scores of the predicted residues compared to the imaginary "true" models 3 .

Multimers.
The modular protein-protein interaction network (PIN) in a living cell suggests that the protein functions are dependent on their interactions 68,69 . The AlphaFold-Multimer 8 has been incorporated into AF2 to model the protein multimers一both homomers and heteromers一that is applicable to analyze the interactions and dynamics of the PPIs, at least in silico. RoseTTAFold 7 was also applied with a similar approach to model cellular core complexes involved in key cellular functions such as transcription, translation and DNA repair 9 . The multimer models from AF2 or/and RoseTTAFold are extremely useful for understanding the PINs and modular protein functions. Known PINs are mainly based on curations of the experimental results such as those from the yeast two-hybrid experiments 70 . These networks are binary, i.e., the strength, or affinity, of the two interacting proteins are unknown 71 . The multimer models also provide the opportunity to simulate the protein-protein interaction strengths. Here, we modeled and simulated both a heterodimer and a homodimer to test the trends we observed in the monomers, as shown in Fig. 5. In the heterodimer model, we used the sequences of the PAS-A domain and kinase domain as two independent entries, and applied AlphaFold-Multimer to construct the dimer structure. In this model, consistent with the monomer models, the AF2-scores correlate well with the RMSF from MD, indicating that the AlphaFold-Multimer also captures the residue flexibility (Fig. 5a). In addition, both the PAE and DV heatmaps (Fig. 5c) show interactions between the PAS-A and kinase "proteins", consistent with the full PAS-kinase model. Therefore, beyond the multimer structures, AF2 can also predict the dynamics characteristics associated with the protein-protein interactions.
We used the sequence of a probable MerR family transcriptional regulatory protein from Mycobacterium tuberculosis (UniProt ID O53384). The monomer of this protein is available at the AF2 database at https:// alpha fold. ebi. ac. uk/ entry/ O53384. Using AlphaFold-multimer, the homodimer structure of this protein was constructed. Again, the AF2-score profile is consistent with the RMSF from MD (Fig. 5b), and the strong interactions between the two chains of this homodimer are shown in both the PAE and DV heatmaps (Fig. 5d), agreeing well with the known structures and dynamics of the Hg(II)-dependent MerR homodimer 59 . Moreover, the opening-and-closing dynamics of this homodimer was shown as the largest amplitude movement (PC1 from the principal component analysis of the MD trajectory), consistent with the previous simulations of the MerR family homodimers 59 .
Intrinsically disordered and randomized proteins. Intrinsically disordered proteins (IDPs) or protein regions (IDPRs) are abundant in all organisms [72][73][74] . Proteins with structures deposited in the protein data bank 75 , even proteins with high-resolution X-ray structures, also contain significant disorder contents in their sequences 60,76 . It has been shown that the pLDDT scores provided by AF2 can also be applied to detect disorder 19 . For example, pLDDT scores lower than 50 are often indications of disorder in a protein 25,31 . The human structruome constructed by AF2 covers 58% of residues with a confident prediction (pLDDT > 70) 4 , indicating the prevalence of IDPs and IDPRs in proteomes 77 .
We examined a fully disordered protein, NVJP-1 from the marine sandworm Nereis virens 78 . For this protein, no MSA hit has been found by AF2. The NVJP-1 protein is fully disordered, reflected by the IUPRED2 scores and the pLDDT scores (Fig. 6a). For clarity, the pLDDT scores are divided by 100 and are not normalized. Consistent with the IUPRED2 trend, all residues in NVJP-1 have a median pLDDT score of 42.8, and an IQR of 7.3, with 334 out of 387 residues demonstrating pLDDT scores lower than 50.0, suggesting disorder for these residues 31 . The RMSF profile of the NVJP-1 protein does not correlate with the AF2 (or pLDDT) scores, however, it shows a moderate correlation with the IUPRED scores (Fig. 6a). In addition, as indicated in the PAE and DV maps (Fig. 6c), all-atom superposition for the RMSF calculation is insufficient to estimate the flexibility of the residues due to large distance variations among the residues (also see Figs. 3A and S2).
We also compared the PAE and DV maps of NVJP-1 (Fig. 6c), which exhibited significant similarity (PCC = 0.529, P = 0). Unlike other globular proteins (Figs. 3c and S1 in the SI), the PAE or DV maps indicated that all residues in the protein have considerably high PAEs and DVs to other residues, even to their closely adjacent residues. The definition of "disordered" is as ambiguous as the definition of "ordered" 79 , given that rapid configurational dynamics in any protein continually occur in the cells 80 . Moreover, many IDPs may acquire folded structures upon interaction with a variety of partners, particularly, in the crowded cellular environment 81 . Here, we show that the AF2 models can be used to qualify the states of intrinsic disorder in proteins: large PAEs (and DVs) among adjacent residues serve as a signature of disorder.
We randomly constructed the amino acid sequence of a protein using the methods described previously 60 . For the randomized protein, AF2 did not find any MSA hits, similar to NVJP-1. However, unlike the NVJP-1 model that did not show any folded elements in its structure, the randomized protein contained folded regions (Fig. S1), indicating that fully disordered IDPs such as NVJP-1 do not occur by chance. This is in line with the previous results that randomized proteins have roughly half folded and half unfolded regions based on the disorder content calculations 60 . For both NVJP-1 and the randomized protein ( Fig. 6a/b) www.nature.com/scientificreports/ Figure 6. A fully disordered IDP, NVJP-1 (a, c) and a random protein (b, d). The RMSF (normalized), pLDDT/100 scores, and IUPRED2 scores for the IDP (a) and the random protein (b). The PAE and DV maps for the IDP (c) and random protein (d). Note: in 6 (a, c) the AF2-scores are replaced by pLDDT/100 for clarity. www.nature.com/scientificreports/ from MSA (Table 1). As shown in Fig. 6d, the PAE/DV maps (PCC = 0.482, P = 0) of the random protein are featureless, and resemble those of NVJP-1. A recent study starts with the random "hallucination" sequences that also yields featureless 2D contact maps 15 . However, the contact maps can be optimized by interactions of Markov chain Monte Carlo simulations, and the optimized contact maps corresponded to highly featured protein folds validated by X-ray crystallography 15 . Therefore, the neural network used in AF2 and RoseTTAFold contains sufficiently rich structure and dynamics information for useful protein engineering.
Other models, protein disorder, residue flexibility and AF2 scores. Besides the 11 models discussed above, we analyzed 10 more models from the budding yeast (Saccharomyces cerevisiae), as listed in Table S1 and Fig. S6. To select these proteins, first, all yeast proteins (ca. 6000) were classified 10 groups based on the quantiles of the Hirsch-index (H-index) centrality of the protein-protein interaction network (PIN, adopted from Ref. 71 ). The PINs, similar to other natural networks, have power-law distributions of the node degrees 82 . The H-index 83 (originally proposed for quantifying the research performances of researchers) centrality is a measure that connect both node degree and coreness 84 . Using these models, together with the models shown in Table 1 and Fig. 1, we calculated the overall agreement between the residue flexibility (RMSF from MD) and the per-residue pLDDT scores and the IUPRED2 disorder contents, as shown in Fig. 7. A strong correlation between the mean IUPRED2 scores and mean RMSF is observed (Fig. 7b), but that between mean pLDDT and mean RMSF is less significant (Fig. 7b). However, the per-residue pLDDT scores are highly correlated to the perresidue RMSF (e.g., Figs. 2, 3, 4, 5, 6), yet the IUPRED scores do not correlate to the RMSF and in certain cases even contradict the RMSF scores, for example, models c/d (Fig. 2c/2d) and model g (Fig. 4b) in the main text, and models for the yeast proteins MRPL20, ALG5 and VTI1 in the SI. Protein disorder is ubiquitous. The pLDDT scores have been considered as a predictor for the residue disorder contents: the residues with the pLDDT scores lower than 50 may be located in the IDPR 19 . The aim of the present work is to add an additional time dimension to the 3D protein structures, i.e., to explore the protein dynamics, which can also be understood from the protein residue flexibilities. Intrinsic protein disorder is strongly related to protein flexibilities 55 . It was suggested that it is the protein flexibility that should carry the term "intrinsic", but not the disorder 85 . Here, we show that in all models listed in Table 1 the AF2 scores (or the reversed pLDDT scores) significantly correlate with the RMSF scores obtained from MD simulations.
The success of AF2 is derived from translating the evolutionary knowledge gained from MSA to distance contact matrices. The ensembles of sequences, in principle, also represent the ensembles of structures, among which the structural variations tend to aggregate toward the evolutionarily stable states. Therefore, the multisequence alignment or multistructure alignment should propagate along similar trajectories, which had been verified in both principal component analysis 86 and the elastic network models 87 . It is, therefore, possible to decode the dynamics information encoded in the protein sequences from the evolutionary history, i.e., MSA.  Tables 1 and S1.